Identifying long-term survivors and those at higher or lower risk of relapse among patients with cytogenetically normal acute myeloid leukemia using a high-dimensional mixture cure model

Patients with cytogenetically normal acute myeloid leukemia (CN-AML) may harbor prognostically relevant gene mutations and thus be categorized into one of the three 2022 European LeukemiaNet (ELN) genetic-risk groups. Nevertheless, there remains heterogeneity with respect to relapse-free survival (RFS) within these genetic-risk groups. Our training set included 306 adults on Alliance for Clinical Trials in Oncology studies with de novo CN-AML aged < 60 years who achieved a complete remission and for whom centrally reviewed cytogenetics, RNA-sequencing, and gene mutation data from diagnostic samples were available (Alliance trial A152010). To overcome deficiencies of the Cox proportional hazards model when long-term survivors are present, we developed a penalized semi-parametric mixture cure model (MCM) to predict RFS where RNA-sequencing data comprised the predictor space. To validate model performance, we employed an independent test set from the German Acute Myeloid Leukemia Cooperative Group (AMLCG) consisting of 40 de novo CN-AML patients aged < 60 years who achieved a complete remission and had RNA-sequencing of their pre-treatment sample. For the training set, there was a significant non-zero cure fraction (p = 0.019) with 28.5% of patients estimated to be cured. Our MCM included 112 genes associated with cure, or long-term RFS, and 87 genes associated with latency, or shorter-term time-to-relapse. The area under the curve and C-statistic were respectively, 0.947 and 0.783 for our training set and 0.837 and 0.718 for our test set. We identified a novel, prognostically relevant molecular signature in CN-AML, which allows identification of patient subgroups independent of 2022 ELN genetic-risk groups. Trial registration Data from companion studies CALGB 8461, 9665 and 20202 (trials registered at www.clinicaltrials.gov as, respectively, NCT00048958, NCT00899223, and NCT00900224) were obtained from Alliance for Clinical Trials in Oncology under data sharing study A152010. Data from the AMLCG 2008 trial was registered at www.clinicaltrials.gov as NCT01382147. Supplementary Information The online version contains supplementary material available at 10.1186/s13045-024-01553-6.

novo CN-AML patients aged <60 years who were treated in the AMLCG 2008 trial [7], achieved a CR (excluding patients who achieved CR with incomplete blood recovery), and had gene expression measured in their pre-treatment samples using the RNA-seq.
Mutation data for the AMLCG cohort were available for 19 genesto enforce consistency between the training and test sets, we restricted attention to these 19 gene mutations.All patients provided written informed consent for participation in the treatment studies.Institutional Review Board approval of all CALGB/Alliance and AMLCG protocols was obtained before any research was performed.

Training set RNA-seq sample preparation
The training set of de novo CN-AML patients aged <60 years who achieved complete remission was derived from a larger dataset that initially consisted of 893 samples from patients with abnormal AML or CN-AML, 29 CBF patients from the Nationwide cohort, and 49 patients from the Nationwide Disparities AML Project.Tumor derived RNA was subjected to 0.8x SPRI bead cleanup and size selection prior to DNase treatment and ribodepletion prior to using the NEBNext Ultra II Directional kit preparation.The library was constructed for whole transcriptome sequencing (RNA-seq).Paired-end reads were either 50-bp or 150-bp and were generated on the Illumina NovaSeq (Illumina, Inc.).Prior to alignment, 150-bp reads were trimmed to 50-bp to mitigate batch effects due to read length.Then reads were aligned to the human genome reference sequence (GRCh38).

Training set RNA-seq data pre-processing
Samples with < 10 million total counts and/or < 60% of total counts assigned to protein coding genes were removed from the dataset.Two batches were defined: as 893 samples

Clinical endpoints and statistical analysis
We estimated RFS as the time from CR until relapse, death, or last-follow-up, censoring for patients alive without relapse.Following the 2022 ELN guidelines [8], we considered both relapse and death as events when estimating RFS.Therefore, herein "cure" is synonymous with attaining long-term RFS.Prior to applying an MCM, we needed to verify we had sufficient follow-up [9,10] and that there was a significant nonzero cure fraction in our dataset.To this end, we examined the Kaplan-Meier survival curve to ascertain the plateau and estimated the time at which 95% of events should occur [11].We then performed a hypothesis test to detect whether there was a significant non-zero cure fraction for RFS using both the simulated non-parametric and the parametric approaches [11,12].
We were interested in investigating the effects of the covariates on our time-toevent outcome.We hypothesized that there was one subgroup consisting of patients immune to the event (those relapse-free, in other words, those who were cured) and another subgroup consisting of patients susceptible (termed susceptibles) to the event (relapse or death).We let p represent the proportion of susceptibles, so (1p) represents the proportion cured.The finite mixture model can be used to arrive at the overall RFS model, (), which is decomposed into the sum of the individual subgroup survival functions weighted by the proportion of the relevant subgroup size, or where  !() is the survival function for those cured while  " () is the survival function for those susceptible [12], We also note that for those cured, the survival function,  !(), at any timepoint is 1, which can be substituted into Eq. 1 to yield a simplified survival function.As previously mentioned, we were interested in estimating the effect of covariates on the outcome.Notice again that our outcome model (Eq. 1) consists of two components: the incidence component which models susceptible versus cured and the latency component which models time-to-event for those susceptible.It is possible that different covariates affect these different parts of our model, so we let x and w denote the covariates included in the incidence and latency components, respectively.Thus, our MCM (|, ) = .1 − ()/ + () " (, | = 1) (Eq.2) allows us to identify the effect of covariates on the likelihood of being susceptible to the event versus immune to the event (or cured) and the time-to-event (or latency) among those susceptible.The latency or time-to-event function for susceptibles can be modeled in different ways, using, for example, a parametric, a non-parametric, or semiparametric survival model.We choose to use the semi-parametric Cox model as it relaxes the parametric assumptions in the baseline latency distribution and therefore increases model flexibility over parametric MCMs [13].
Our goal was to identify a multivariable model using genomic features, namely genes having expression significantly associated with either the cure or latency.
Because few variable selection methods exist for high-dimensional MCMs, we developed a penalized semi-parametric MCM for high-dimensional datasets to identify prognostically relevant genes that can distinguish patients cured from patients susceptible to the event with lower or higher risk of relapse.Given our observed data and unknown parameters, we used the Expectation-Maximization (E-M) algorithm [14] where we divided the problem into logistic regression for cure status component and survival regression for susceptible patients to estimate the parameters for the latency component of the model.Specifically, we expressed the complete-data log-likelihood assuming the cure status is observed and included a LASSO penalty for the incidence covariates and a LASSO penalty for the penalized latency covariates.Initial values for the intercept for the incidence portion of the model and the baseline hazard for the latency portion of the model were obtained by fitting logistic regression and survival models, respectively, and all penalized coefficients were initialized to 0. In the Estimation step (E-step), we calculated the expected penalized complete-data log-likelihood with respect to the conditional distribution of cure given the current parameter estimates and observed data, where we replaced the unknown cure status indicator with its estimated proportion.In the Maximization step (M-step), we estimated the parameters by expressing the objective function as the sum of two separate functions corresponding to the incidence and latency components of the model.The incidence component was estimated as a penalized logistic regression with a fractional response.
The latency component involves the parameters and the baseline latency function, so the optimization problem turned into a penalized Cox regression with an offset term.
The baseline latency was estimated by first obtaining Breslow's estimator for the cumulative baseline hazard though inclusion of an extra term in the denominator and applying the Weibull tail completion method when the last observation is censored.The E-M algorithm alternated between an E-step and a M-step until convergence.

Mixture cure model applied to the training and test sets
RNA-seq expression values were candidate covariates in our semi-parametric penalized MCM to identify a parsimonious list of transcripts associated with cure or latency.Pre-processing steps applied to the RNA-seq data for the training set are described in the Training Set RNA-seq Data Pre-processing section.The AMLCG RNAseq data were processed as previously described [6].We combined the training and test sets then voom normalized [15] and applied ComBat [16] to batch-correct RNA-Seq data, then filtered the data by applying thresholds to gene-level means (>2) and standard deviations (>0.75) of our training set, leaving 4,770 genes for model fitting.
Ten-fold cross validation was repeated 100 times to choose the optimal model using the training set, using the cross-validated area under the receiver operating characteristic curve (AUC) to select the penalty for the incidence portion of the model and the crossvalidated C-statistic to select the penalty for the latency portion of the model.After fitting our penalized semi-parametric MCM to the training set, we applied the fitted MCM to the test set.

Results
Demographic, clinical, and select gene mutation data for patients in our training and test sets are presented in Supplementary Table 4.The training set was more balanced with respect to sex than the test set (p=0.028), and the test set patients had significantly higher percentage of BM blasts (p=0.021),whereas the training set patients had significantly higher percentage of blasts in the blood (p<0.001).Among 19 gene mutations examined, there were differences in the frequencies of NRAS (p=0.025) and WT1 (p<0.001) mutations between the training and test sets, whereas there was no significant difference with respect to the patient assignment into the 2022 ELN geneticrisk groups (p=0.749).

Model development
There was a significant non-zero cure fraction for RFS in the training set (p<0.0001) with the estimated percentage of patients cured at 28.5% (Figure 1A).This demonstrates heterogeneity of the outcomes of patients with CN-AML, and that our data are likely composed of two subgroups: one subgroup of patients who are susceptible to the event (relapse or death) and the second of patients who are "cured." Prior to applying our penalized semi-parametric MCM, we verified that we had sufficient follow-up data and that a significant non-zero cure fraction was present in our training set. Figure 1A depicts RFS for our training set, estimated using the Kaplan-Meier method.The time at which 95% of events should have occurred was estimated to be 5.3 years and the 95 th percentile of observed event times was 4.92 years, whereas the median follow-up among patients alive at their last follow-up was 9.78 years (range, 0.61 to 21.23 years).This, together with the long plateau in the RFS curve, which does not drop down to zero (Figure 1A), provides empirical evidence of sufficient follow-up.

Examination of model predictions in conjunction with known prognostic factors
In the training set, patients predicted to be cured and those predicted to be susceptible with lower risk of relapse or death had lower white blood cell counts (WBC; p=0.02) and percentages of BM blasts (p=0.018)than patients predicted to be susceptible with higher risk of relapse or death (Supplementary Table 5).There were also significant differences among these three groups with respect to DNMT3A (p<0.001),CEPBA bzip (p=0.035), and RUNX1 (p=0.036)mutations and the presence of tyrosine kinase domain mutations in the FLT3 gene (FLT3-TKD; p=0.005) and internal tandem duplications of the FLT3 gene (FLT3-ITD; p<0.001), as well in categorization to the 2022 ELN genetic-risk groups (p<0.001)(Supplementary Table 5).However, when we examined the mutational status of genes having mutations that occurred in at least 30 patients, which included NPM1, CEPBA bzip , DNMT3A, FLT3-ITD, FLT3-TKD, IDH1, IDH2, NRAS, and PTPN11, we found significant differences between patients predicted to be cured versus those predicted to be susceptible and between those predicted to be susceptible with lower and higher risk of relapse or death (Supplementary .This indicated that our model improved upon inclusion of these mutation data.

Model validation
Patients in our independent test set were diagnosed with AML more recently and therefore their median follow-up of 4.05 years (range, 0.11 to 6.27 years) among patients alive at their last follow-up was shorter than follow-up of patients in our training set (Figure 1D).Thirteen patients underwent an allogeneic HSCT in first CR and were thus censored at the date of HSCT, which also differs from our training set.While the Kaplan-Meier curve does not necessarily show that a cure fraction is present (Figure 1D), because this patient cohort consists of younger CN-AML patients with RNA-seq data available, we considered it a useful proxy test set.In the test set, there were significant differences among patients predicted to be cured versus those predicted to be susceptible with lower and higher risk of relapse or death for CEBPA bZIP (p=0.020),GATA2 (p=0.02), and WT1 (p=0.017)mutations (Supplementary Table 6).

Biological relevance of genes in our MCM signature
Topp Gene (accessed on August 18, 2023) was used to provide biological insights with respect to genes included in our MCM signature.Several of them were identified in previously published studies.Fourteen genes, namely, PRTN3, SCRN1, ARHGEF17, PROM1, NRXN2, SERPING1, AK4, ITM2C, ALDH2, FHL1, MDFIC, CD34, ELANE and EMP1, were differentially expressed between AML patients with NPM1 mutations and those with wild-type NPM1 (FDR<0.0001)[17], whereas 11 genes, CDKN2C, PRR5L, PITPNC1, FAAH, PROM1, ITM2C, USP13, MDFIC, RHD, GYPA and CD34, were downregulated in AML patients with cytoplasmic NPM1 localization compared with those without cytoplasmic NPM1 localization (FDR=0.0001)[18].We examined the performance of our model in NPM1-mutated patients and there was a significant difference in RFS between patients predicted to be cured versus those predicted to be susceptible (p<0.0001) and between those predicted to be susceptible having higher versus lower risk of relapse or death (p<0.0001)(Supplemental Figure 3).
Four genes, IL18, PRTN3, ALDH2 and HDC, were previously identified to be in the Cebpa signaling network that was identified from differentially expressed genes in leukemic cells from murine spleens comparing Cbfb+/MYH11d179-221 and Cbfb+/MYH11 chimeras (FDR=0.009)[19].When examining the performance of our model in CEBPA bZIP -mutated patients, there was a significant difference in RFS between patients predicted to be cured versus those predicted to be susceptible (p<0.0001) and between those predicted to be susceptible having higher versus lower risk of relapse or death (p<0.0001)(Supplemental Figure 4).
In prior studies, development of prognostic models derived using high-throughput gene-expression data selected genes either by assigning patients to two groups, those who remained in CR ≥3 years and those who relapsed, or predicted attainment of CR [49].Others fit univariate Cox proportional hazards models followed by (1) k-means clustering to identify naturally occurring groups related to time-to-event outcome [40], (2) principal component analysis [50], or (3) a multivariable Cox model [51,52], or validated previously identified gene clustering using compound covariate prediction by modeling alive versus dead at last follow-up rather than fitting a time-to-event model [41].However, in several publications, the Kaplan-Meier estimate of survival demonstrates a long plateau that does not drop down to zero despite long follow-up, suggesting the existences of a subgroup of CN-AML patients who enjoy long-term RFS that approaches their population expected survival.In fact, it has been suggested that AML patients attaining 3-year RFS can be considered "potentially cured" [2].Moreover, in a study of 1,068 AML patients who achieved a CR, the failure rate declined over time, thus the proportional hazards assumption was not satisfied [2].Our multivariable mixture cure model overcomes problems encountered when the proportional hazards assumption is violated due to the presence of a cured subset and permits modeling a censored time-to-event outcome rather than defining arbitrary groups such as those who remained in CR ≥3 years and those who relapsed.
A recent study sought to identify molecular markers associated with long-term RFS in CN-AML patients by employing a case-control design whereby single-cell RNA sequencing profiles of 28 CN-AML patients remaining in first CR for at least five years were evaluated against 31 well-matched CN-AML patients who relapsed within two years of achieving their first CR [53].Others developed a prognostic model for patients with CN-AML that was based on the mutational status of select genes (CEBPA bZIP ,

FLT3-ITD, NPM1) and clinical characteristics (age, WBC, Eastern Cooperative
Oncology Group performance status) instead of using high-throughput gene-expression data [54].They also fit a multivariable Cox PH model, which is known to yield inaccurate estimates when a cure fraction is present.
Therefore, instead of employing a case-control design which requires a priori arbitrary timepoints for patient classification, we employed a novel time-to-event model and established a predictive algorithm for identifying long-term survivors using RNA-seq data evaluated in samples collected at the time of diagnosis.Our penalized MCM identified genes associated with cure and latency in younger CN-AML patients, which performed well when applied to an independent test set.

Supplementary references
from the Alliance for Clinical Trials in Oncology (Alliance; consisting of abnormal AML and CN-AML patients) and 78 samples from two Nationwide Projects (CBF and Disparities AML Projects).Genes having a mean counts-per-million ≥ 0.5 in at least one batch were retainedhttps://bitbucket.org/anthakki/poibm/src/master/ on March 2, 2023) was used for batch correction.The poibm.fit function was called with the count data in the Nationwide batch (source) and Alliance batch (target space).The phenotype parameters were set to rhoX = 0.25 and rhoY=1; parameter values controlling the E-M estimation were set to max.iter = 100 and max.resets = 20.The poibm.apply function was used to apply the fits (corrections) to the Nationwide sample counts, followed by rounding to the nearest integer.